Text Visualization in Social Science
We will start with basic data visualization in R focusing on ggplot2. Kieran Healy’s Data Visualization A practical introduction is a good resource to learn basic visualization. You can follow his book website and install his package for learning purposes
After we are familiar with basic data visualization in R, we will switch to visualize texts using these basic techniques. We don’t cover other fancy visualization tools. If you are interested in those tools, you can learn more by checking shiny, plotly, etc.
There are a couple of books you should read. R for Data Science; ggplot2 cookbook. For disclosure, some of the example codes are from R for data science.
DataViz Basics in R
We need to load some packages for use
if (!requireNamespace("pacman"))
install.packages('pacman')
library(pacman)
packages<-c("tidyverse","tidytext","haven")
p_load(packages,character.only = TRUE)Tidy data
Let us load doca data into R, the input is a csv file.
We will also use the doca main dataset. You can go here to download the dataset https://web.stanford.edu/group/collectiveaction/cgi-bin/drupal/node/21.
We use the read_dta function in haven to read stata file
Let us merge two datasets together using the key identifier title and title_doca. We will use tidyverse fuction left_join. It is similar to stata merge function.
data <- main_doca %>%
mutate(title_doca=tolower(title)) %>%
left_join(nyt_doca %>%
mutate(title_doca=tolower(title_doca)),
by="title_doca") %>%
filter(!is.na(text)) %>%
select(title,title_doca,text,everything())
knitr::kable(data[1:5,1:2],cap="DoCA with NYT article")| title | title_doca |
|---|---|
| ILLINOIS UNIT IN COURT | illinois unit in court |
| DESECRATION IN PHILADELPHIA | desecration in philadelphia |
| VICTORS AT DALLAS ACCUSE FOES OF DIRTY PLAY AND RACIAL SLURS | victors at dallas accuse foes of dirty play and racial slurs |
| CITY ACTION URGED TO COMBAT BIGOTS | city action urged to combat bigots |
| CITY ACTION URGED TO COMBAT BIGOTS | city action urged to combat bigots |
Let us create a tidy dataset, keeping text, title_doca, eventid,event year, violence, participant size, what, purpose,whysm. here is the codebook
Let us do some cleaning about main news articles.
Understanding ggplot2
If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function(). For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package.
We have to admit that ggplot2 is the most popular R graph package in social science community. If you don’t know about it, you should read R for Data Science; ggplot2 cookbook.
Let us load the package if you did not load it before.
With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = tidy_data) creates an empty graph.
You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.
# let us say, we want to see the number of articles by year
# we need to compute the yearly number of articles first
# we pass tidy_data to ggplot and do a scatterplot
# we assign the ggplot object to variable P
tidy_data %>%
distinct(title_doca,.keep_all = T) %>%
filter(!is.na(evyy)) %>%
group_by(evyy) %>%
summarise(articles_n=n()) %>%
ggplot()+
geom_point(aes(x=evyy,y=articles_n))->p
pObviously this is an ugly plot. Let us do some extra work to beautify it… let us change x and y axis title and add a caption
p <- p +
# Add titles, subtitles, caption, change x, y axis label
labs(title = "Annual NYT Coverage of Protest in the U.S. 1960-1995",
subtitle = "Based on a random sample (N=2000)",
caption = "Data source: Dynamic of Collective Action and ProQuest",
x = "Event Year",
y = "Number of News Articles"
)+
# Format the title, subtitle, and caption
theme(
plot.title = element_text(
color = "red",
size = 12,
face = "bold"
),
plot.subtitle = element_text(color = "blue"),
plot.caption = element_text(color = "blue",
face = "italic"))+
# relabel x and y axis
scale_x_continuous(breaks=seq(1960,1995,5), limits=c(1960,1995))+
scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))
p I don’t like the background. Let us change the theme to the classic
How about adding a smooth line?
This new plot contains the same x variable, the same y variable, and both line and points describe the same data. But they are not identical. They use a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.
let us try a histogram plot.
tidy_data %>%
distinct(title_doca,.keep_all = T) %>%
filter(!is.na(evyy)) %>%
ggplot()+
geom_histogram(aes(x=evyy),binwidth = 0.5)+
theme_classic()->p1
p1Let us try a bar chart.
tidy_data %>%
distinct(title_doca,.keep_all = T) %>%
filter(!is.na(evyy)) %>%
ggplot()+
geom_bar(aes(x=evyy),binwidth = 0.5)+
theme_classic()->p2
p2I am tired of graphing number of articles by year. Let us try purpose. We only care about those with at least 2 articles
p3 <- tidy_data %>%
mutate(purpose=tolower(purpose)) %>%
filter(purpose!="") %>%
group_by(purpose) %>%
summarise(purpose_n=n()) %>%
filter(purpose_n>2) %>%
ggplot(aes(x=purpose,y=purpose_n))+
geom_bar(stat="identity")
p3 Totally a mess. You can flip the x-y axis. Let us switch our x axis to y axis
You can check the ggplot cookbook for more details
Understanding ggplot2 supplement
There are a lot of ggplot2 related packages you can use to visualize your data. For instance, gganimate, ggnet2, gganimate, ggdendro, ggthemes, ggpubr, Plotly, patchwork, ggridges,ggmap,ggrepel,ggradar,ggcorrplot,GGally
let us install them all. You should spend some time to explore these packages. If some of these packages you cannot install. You should try devtools::install_github()
ggpackages <- c("gganimate", "ggnet2", "gganimate", "ggdendro", "ggthemes", "ggpubr", "plotly", "patchwork","ggridges","ggmap","ggrepel","ggradar","ggcorrplot","GGally")
p_load(ggpackages,character.only = T)
#devtools::install_github("ropensci/plotly")
#devtools::install_github("briatte/ggnet")Let us try plotly
TextViz Basics in R
We use TidyText with ggpplot2 to do some basic textvizs
tidytext::unnest_tokens provides us with a function to tokenize words: unnest_tokens( tbl, output, input, token = “words”, format = c(“text”, “man”, “latex”, “html”, “xml”), to_lower = TRUE, drop = TRUE, collapse = NULL, … )
library(tidytext)
library(SnowballC)
# we need to process text data
token_data <- tidy_data %>%
# create a unique id
mutate(doca_id=row_number()) %>%
# Let us tokenize tidy_data text field
unnest_tokens(output = word,input = tidy_text,token="words") %>%
# get rid of stop words
anti_join(tidytext::get_stopwords("en",source="snowball")) %>%
# let us do some stemming
mutate(word_stem = wordStem(word)) %>%
filter(word_stem!="")## # A tibble: 33,619 x 2
## word_stem n
## <chr> <int>
## 1 said 13616
## 2 new 10837
## 3 time 8331
## 4 york 8233
## 5 mr 7535
## 6 school 5263
## 7 student 5120
## 8 state 4678
## 9 citi 4302
## 10 polic 4257
## # … with 33,609 more rows
let us make a document-term matrix first
dtm_data <- token_data %>%
count(doca_id, word_stem, sort = TRUE) %>%
cast_dtm(doca_id, word_stem, n)
dtm_data## <<DocumentTermMatrix (documents: 2477, terms: 33619)>>
## Non-/sparse entries: 550593/82723670
## Sparsity : 99%
## Maximal term length: 44
## Weighting : term frequency (tf)
What are the highest tf-idf words in our documents? Let us plot them
tfidf_data <- token_data %>%
count(doca_id, word_stem, sort = TRUE) %>%
bind_tf_idf(word_stem, doca_id, n) %>%
arrange(-tf_idf) %>%
group_by(doca_id) %>%
top_n(10) %>%
ungroup
knitr::kable(tfidf_data[1:10,],cap="DoCA with NYT article, TF-IDF")| doca_id | word_stem | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| 405 | beloit | 4 | 0.0816327 | 7.814803 | 0.6379431 |
| 833 | mastic | 3 | 0.0714286 | 7.814803 | 0.5582002 |
| 1578 | krishna | 8 | 0.0898876 | 5.868893 | 0.5275410 |
| 1579 | krishna | 8 | 0.0898876 | 5.868893 | 0.5275410 |
| 881 | fisk | 5 | 0.0781250 | 6.716191 | 0.5247024 |
| 720 | cheynei | 4 | 0.0666667 | 7.814803 | 0.5209869 |
| 1752 | arab | 4 | 0.1176471 | 4.413606 | 0.5192478 |
| 1243 | chees | 6 | 0.0869565 | 5.617579 | 0.4884851 |
| 2268 | bigotri | 4 | 0.1000000 | 4.770281 | 0.4770281 |
| 1115 | nader | 5 | 0.0877193 | 5.329897 | 0.4675348 |
Replicate fighting words article metrics
fighting words
Let us see whether violence influences media coverage of protest… The goal here is to compare the words in two corpora: news articles with violence in protest and news articles without violence. We are trying to see what words are more likely to associate with violence and nonviolence…
We use the following formula:
\[f_{kw}^{(V)}-f_{kw}^{(NV)}\] where \[f_{kw}^{(V)}=y_{kw}^{(V)}/n_k^{(V)}\]
# first we need to compute these two metrics for each word in two corpora
metric_data <- token_data %>%
count(word_stem, name="tot_count",sort = TRUE) %>%
left_join(token_data %>%
filter(!is.na(viold)) %>%
mutate(viold=ifelse(viold==1,"V","NV")) %>%
count(viold,word_stem) %>%
pivot_wider(names_from=viold,values_from=n,values_fil=0),
by="word_stem") %>%
mutate(
fv=V/tot_count,
fnv=NV/tot_count,
fv_fnv=fv-fnv,
weight=abs(fv_fnv)
) %>%
# drop top common words or rare words
filter(tot_count<1500,tot_count>50)
knitr::kable(metric_data[1:10,],cap="DoCA with NYT article, Volenceor not") | word_stem | tot_count | NV | V | fv | fnv | fv_fnv | weight |
|---|---|---|---|---|---|---|---|
| special | 1499 | 1279 | 217 | 0.1447632 | 0.8532355 | -0.7084723 | 0.7084723 |
| colleg | 1482 | 1388 | 94 | 0.0634278 | 0.9365722 | -0.8731444 | 0.8731444 |
| commun | 1471 | 1257 | 210 | 0.1427600 | 0.8545207 | -0.7117607 | 0.7117607 |
| public | 1443 | 1313 | 129 | 0.0893971 | 0.9099099 | -0.8205128 | 0.8205128 |
| women | 1438 | 1350 | 88 | 0.0611961 | 0.9388039 | -0.8776078 | 0.8776078 |
| continu | 1432 | 1200 | 230 | 0.1606145 | 0.8379888 | -0.6773743 | 0.6773743 |
| dr | 1386 | 1171 | 213 | 0.1536797 | 0.8448773 | -0.6911977 | 0.6911977 |
| without | 1375 | 1179 | 196 | 0.1425455 | 0.8574545 | -0.7149091 | 0.7149091 |
| report | 1374 | 1071 | 302 | 0.2197962 | 0.7794760 | -0.5596798 | 0.5596798 |
| first | 1354 | 1154 | 200 | 0.1477105 | 0.8522895 | -0.7045790 | 0.7045790 |
let us get top 10 violent words and top nonviolent words
let us replicate the graph
metric_data %>%
filter(!is.na(fv_fnv)) %>%
ggplot(aes(x=tot_count,
y=fv_fnv))+
geom_point()+
theme_bw()+
theme(legend.position = "none")metric_data %>%
filter(!is.na(fv_fnv)) %>%
left_join(top50_words %>% transmute(word_stem,top50_words=word_stem),by="word_stem") %>%
ggplot(aes(x=tot_count,
y=fv_fnv))+
geom_point()+
ggrepel::geom_label_repel(aes(label=top50_words))+
theme_bw()+
theme(legend.position = "none")Lab 5 Problem Set
We will continue to use NYT doca 2000 news articles as our dataset for text viz. You need to train a topic model using stm.
Then you should visualize the topics using some extra packags we provided in stm lab tutorial.
Send me a screen-shot before Tuesday 6 PM.